##
Read 14.4% of 1458644 rows
Read 33.6% of 1458644 rows
Read 43.2% of 1458644 rows
Read 64.4% of 1458644 rows
Read 65.1% of 1458644 rows
Read 77.5% of 1458644 rows
Read 90.5% of 1458644 rows
Read 1458644 rows and 11 (of 11) columns from 0.187 GB file in 00:00:10
##
Read 72.0% of 625134 rows
Read 625134 rows and 9 (of 9) columns from 0.066 GB file in 00:00:04
## [1] 0
## [1] 0
Since transport hubs are usually hot spots in a city. It may well provide useful information when we predict trip duration, and improve prediction performance. We created new predictors based on the variables provided in the raw dataset.
Airports are not in the city center, so it may provide useful information when we predict long trip durations. Refering to prior work, we analyzed whether there is a significant difference between airports trip and non-airports trip.
Extending previous work on this problem, besides considering airport, we also analyzed the train station (Penn Station) and bus station (Port Authority Bus Terminal).In contrast to the airports, the train and bus station are in the city center. It could be a useful predictor for shorter trip duration.
Step:
Get coordinates of transport hubs +JFK airport: longitude = -73.778889, latitude = 40.639722 +La Guardia airport: longitude = -73.872611, latitude = 40.77725 +train station (Penn Station): longitude = -73.993584, latitude = 40.750580 +bus station (Port Authority Bus Terminal): longitude = -73.9903, latitude = 40.7569
Calculate direct distance from pickup and dropoff location to transport hubs.
Choose thresholds: based on step2, we used histogram to choose thresholds, and define transport hubs trip. If distance < thresholds, we defined it as a transport hubs trip (labeled as TRUE). Note that the thresholds are different for these transportation station trips. We choose the thresholds through the histogram graphs.
Add new predictors: we created indicators corresponding to thresholds from step3.
From the histogram of 4 transport hubs, we got thresholds for defining transport hubs trip. Because airports are far from the center of NEW YORK, the threshold for airports is larger than that of train and bus station.
| Station | Threshold (meter) |
|---|---|
| JFK | 2000 |
| LG | 2000 |
| TRAIN | 500 |
| BUS | 200 |
Then we use boxplots to check if there is significant difference between trip duration of transport hubs trip and non-transport hubs trip.
From boxplots, we find that the difference of trip duration between airport-trip and non-airport-trip is significant. For the train and bus station trip, the median of trip duration is only slightly different, which is difficult to identify from the boxplots. However, the distribution of trip duration for the TRUE class is more concentrated. Also, from the interaction plots, the slopes are different, and trip duration of TRUE train or bus trip is shorter.
Because transport hubs are hot spots in the city, we included these indicators in our model, and check if it makes a difference for the prediction accuracy.
We divide NEW YORK into 10 clusters by pickup location and dropoff location. And calculate the frequency of pickup or dropoff in each cluster. It is reasonable to assume that high frequency clusters are more likely to have a longer trip duration after controling for distance. Before clustering, we clean training dataset first to improve k-means performance.
Data Cleaning:
First, only the trip durations that are shorter than 223600 are kept in the prediction, and the trip durations that last longer than that are deleted. 22 here represents the hour and 3600 represents the seconds per hour. The trip durations that are longer than that 223600 means that the person take the taxi for more than 22 hours, which is almost a whole day. This situation rarely happens, so we decide to consider these data as outliers or miss-recoded, and delete them.
Then, the distances that are near 0 and the trip durations that are fewer than 60 seconds are deleted. These conditions mean that the passenger actually does not take the taxi, or get off the car immediately. No matter for what reason, these data are not useful in prediction and could be outliers in the training process.
Also, we can choose the distances that that are longer than 3e5 to the JFK airport as the trips that are related to airport and the city. We could filter the data that are less than 3e5 distance from the airport, and consider them as the trips that are near the airports. These chosen trips that have longer distance are considered as useful predictors for the longer trips.
The final cleaning would be choosing the trip durations that are longer than 10 seconds, and the speed that is less than 100. These are reasonable assumptions that are based on the real situations, and the miss-recorded data or very rare cases are deleted.
| pcluster | count1 | freq1 | pickup_longitude | pickup_latitude |
|---|---|---|---|---|
| 1 | 59552 | 0.029 | -73.87045 | 40.77015 |
| 2 | 34905 | 0.017 | -73.96936 | 40.69272 |
| 3 | 186185 | 0.090 | -73.98038 | 40.77786 |
| 4 | 280200 | 0.135 | -73.95491 | 40.77355 |
| 5 | 414142 | 0.200 | -73.99368 | 40.73155 |
| 6 | 84589 | 0.041 | -73.95797 | 40.80258 |
| 7 | 422190 | 0.203 | -73.99084 | 40.75094 |
| 8 | 124174 | 0.060 | -74.00960 | 40.71323 |
| 9 | 46508 | 0.022 | -73.78427 | 40.64671 |
| 10 | 422952 | 0.204 | -73.97419 | 40.75755 |
| dcluster | count2 | freq2 | dropoff_longitude | dropoff_latitude |
|---|---|---|---|---|
| 1 | 11987 | 0.006 | -73.88554 | 40.87309 |
| 2 | 340333 | 0.164 | -73.95474 | 40.77246 |
| 3 | 25701 | 0.012 | -73.78055 | 40.66179 |
| 4 | 687603 | 0.331 | -73.98163 | 40.75577 |
| 5 | 85737 | 0.041 | -73.95692 | 40.68570 |
| 6 | 124657 | 0.060 | -74.01534 | 40.70538 |
| 7 | 175166 | 0.084 | -73.97701 | 40.78412 |
| 8 | 53708 | 0.026 | -73.87407 | 40.75878 |
| 9 | 88522 | 0.043 | -73.94702 | 40.81300 |
| 10 | 481983 | 0.232 | -73.99419 | 40.73357 |
After clustering, we make cluster map to show the frequency of pickup and dropoff in different areas of NEW YORK. From cluster map, we find:
Drop-off location are more scattered than pickup location. Light blue indicates more dropoff or pickup happened in that area.
Note: to make the cluster map more clear, we sampled 5000 observations from the training dataset.
Since distance has large main effects for trip duration. It also tends to have large interactions. We assume the magnitude of its effect depend on the level of frequency. In the high frequency drop-off or pickup area, for one unit change in distance, the trip duration increases more compared to low frequency area.
We make interaction plots to show this interaction effect. It is very clear that the lines are interecting with each other. So we included interaction terms in our prediction model.
Analyze the traffic condition in NEW YORK in different areas and different times. The fast speed route start from center of NEW YORK and end in somewhere far from center.
| cluster | speed center |
|---|---|
| 1 | 22.440065 |
| 2 | 7.885815 |
| 3 | 14.261755 |
| 4 | 36.176836 |
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
From the boxplots of different pickup clusters and dropoff clusters, we find that the speed is significantly different. Speed at different time is also different but not that significant compared to speed in different clusters.
| date | maximum temperature | minimum temperature | average temperature | precipitation | snow fall | snow depth | rain | s_fall | s_depth | all_precip | has_snow | has_rain | max_temp | min_temp | avg_temp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 | 42 | 34 | 38 | 0.00 | 0.0 | 0 | 0 | 0 | 0 | 0 | FALSE | FALSE | 42 | 34 | 38 |
| 2016-01-02 | 40 | 32 | 36 | 0.00 | 0.0 | 0 | 0 | 0 | 0 | 0 | FALSE | FALSE | 40 | 32 | 36 |
| 2016-01-03 | 45 | 35 | 40 | 0.00 | 0.0 | 0 | 0 | 0 | 0 | 0 | FALSE | FALSE | 45 | 35 | 40 |
| 2016-01-04 | 36 | 14 | 25 | 0.00 | 0.0 | 0 | 0 | 0 | 0 | 0 | FALSE | FALSE | 36 | 14 | 25 |
| 2016-01-05 | 29 | 11 | 20 | 0.00 | 0.0 | 0 | 0 | 0 | 0 | 0 | FALSE | FALSE | 29 | 11 | 20 |
| 2016-01-06 | 41 | 25 | 33 | 0.00 | 0.0 | 0 | 0 | 0 | 0 | 0 | FALSE | FALSE | 41 | 25 | 33 |
Analyze temperature
We can see that there is a positive relationship between average trip duration and average temperature. We will treat temperature as a predictor.
## `geom_smooth()` using method = 'loess'
Refering to prior work, we imported another external data set which estimated the fastest routes for each trip provided by oscarleo using the the Open Source Routing Machine, OSRM, and did data manipulation. We compared the results between one with fastest route predictors and one without. Model performance got improved after adding the external information in the fastest route dataset.
##
Read 37.1% of 700000 rows
Read 68.6% of 700000 rows
Read 700000 rows and 12 (of 12) columns from 0.247 GB file in 00:00:05
##
Read 25.0% of 758643 rows
Read 52.7% of 758643 rows
Read 64.6% of 758643 rows
Read 89.6% of 758643 rows
Read 758643 rows and 12 (of 12) columns from 0.436 GB file in 00:00:07
##
Read 3.2% of 625134 rows
Read 49.6% of 625134 rows
Read 59.2% of 625134 rows
Read 94.4% of 625134 rows
Read 625134 rows and 12 (of 12) columns from 0.293 GB file in 00:00:11
To test the validation of the parameters, we first treated the first 2/3 of the data as the training test and the rest as the test data to test the RMSE.
Refering to prior work, we replace the trip_duration with its logarithm. (The + 1 is added to avoid an undefined log(0))
We treat month, wday, hour as continuous variables (integer) in stead of categorical varaibles. Because comparing the rmse on the test dataset, we obtained smaller rmse in the continuous condition.
IMPROVEMENT
Compared to prior work on this problem, we add additional predictors:
We also add another 2 interaction terms:
Compared to just using parameters provided in prior work, our model performance got improved. The rmse on the test dataset is 0.396623 in our refered kernel on Kaggle. After improvement, the rmse on the test dataset is 0.3381741 for our prediction model. So the improvement in the performance of xgboost model is significant.
Note: the rmse may be slightly different if we run xgboost model at another time. But it won’t affect our conclusion that we improved the performance of xgboost model.
## Starting parallelization in mode=socket with cpus=4.
## [1] val-rmse:4.222099 train-rmse:4.221609
## [26] val-rmse:0.346462 train-rmse:0.342320
## [51] val-rmse:0.340191 train-rmse:0.332611
## [76] val-rmse:0.339026 train-rmse:0.328157
## [101] val-rmse:0.338110 train-rmse:0.325425
## [126] val-rmse:0.337791 train-rmse:0.323799
## [150] val-rmse:0.337306 train-rmse:0.322699
## [1] 0.3373062